Substring Statistics

نویسندگان

  • Kyoji Umemura
  • Kenneth Ward Church
چکیده

The goal of this work is to make it practical to compute corpus-based statistics for all substrings (ngrams). Anything you can do with words, we ought to be able to do with substrings. This paper will show how to compute many statistics of interest for all substrings (ngrams) in a large corpus. The method not only computes standard corpus frequency, freq, and document frequency, df , but generalizes naturally to compute, dfk(str), the number of documents that mention the substring str at least k times. dfk can be used to estimate the probability distribution of str across documents, as well as summary statistics of this distribution, e.g., mean, variance (and other moments), entropy and adaptation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Real World Performance of Approximate String Comparators for use in Patient Matching

Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources. To improve linkage accuracy we studied different name comparison methods that establish agreement or disagreement between corresponding names. In addition to exact raw name matching and exact phonetic name matching, we tested three approximate string comparators. The approximate...

متن کامل

Fast and Sensitive Probe Selection for DNA Chips Using Jumps in Matching Statistics

The design of large scale DNA microarrays is a challenging problem. So far, probe selection algorithms must trade the ability to cope with large scale problems for a loss of accuracy in the estimation of probe quality. We present an approach based on jumps in matching statistics that combines the best of both worlds. This article consists of two parts. The first part is theoretical. We introduc...

متن کامل

ar X iv : 1 40 9 . 16 94 v 2 [ cs . D S ] 1 6 M ar 2 01 5 Longest common substrings with k mismatches

The longest common substring with k-mismatches problem is to find, given two strings S1 and S2, a longest substring A1 of S1 and A2 of S2 such that the Hamming distance between A1 and A2 is ≤ k. We introduce a practical O(nm) time and O(1) space solution for this problem, where n and m are the lengths of S1 and S2, respectively. This algorithm can also be used to compute the matching statistics...

متن کامل

Longest common substrings with k mismatches

The longest common substring with k-mismatches problem is to find, given two strings S1 and S2, a longest substring A1 of S1 and A2 of S2 such that the Hamming distance between A1 and A2 is ≤ k. We introduce a practical O(nm) time and O(1) space solution for this problem, where n and m are the length of S1 and S2, respectively. This algorithm can also be used to compute the matching statistics ...

متن کامل

Structural Properties of the String Statistics Problem

A suitably weighted Index: Tree such as a B-tree or a Suff'1X Tree can be easily adapted to store. for a given string % and foro. all substrings w of ·--:-------z-;-tbe-numoer of-distinct-mst.aoces oPwwong z-:-Thest.orage nee-dei::l--is seen to be linear in the length of %: moreover, the whole statistics can itself be derived in linear time. off-line of a RAM., . If the substring w has Dontrivi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009